19 research outputs found

    Learning To Scale Up Search-Driven Data Integration

    Get PDF
    A recent movement to tackle the long-standing data integration problem is a compositional and iterative approach, termed “pay-as-you-go” data integration. Under this model, the objective is to immediately support queries over “partly integrated” data, and to enable the user community to drive integration of the data that relate to their actual information needs. Over time, data will be gradually integrated. While the pay-as-you-go vision has been well-articulated for some time, only recently have we begun to understand how it can be manifested into a system implementation. One branch of this effort has focused on enabling queries through keyword search-driven data integration, in which users pose queries over partly integrated data encoded as a graph, receive ranked answers generated from data and metadata that is linked at query-time, and provide feedback on those answers. From this user feedback, the system learns to repair bad schema matches or record links. Many real world issues of uncertainty and diversity in search-driven integration remain open. Such tasks in search-driven integration require a combination of human guidance and machine learning. The challenge is how to make maximal use of limited human input. This thesis develops three methods to scale up search-driven integration, through learning from expert feedback: (1) active learning techniques to repair links from small amounts of user feedback; (2) collaborative learning techniques to combine users’ conflicting feedback; and (3) debugging techniques to identify where data experts could best improve integration quality. We implement these methods within the Q System, a prototype of search-driven integration, and validate their effectiveness over real-world datasets

    Room-temperature photoluminescence mediated by sulfur vacancies in 2D molybdenum disulfide

    Get PDF
    Atomic defects in monolayer transition metal dichalcogenides (TMDs) such as chalcogen vacancies significantly affect their properties. In this work, we provide a reproducible and facile strategy to rationally induce chalcogen vacancies in monolayer MoS2 by annealing at 600 °C in an argon/hydrogen (95%/5%) atmosphere. Synchrotron X-ray photoelectron spectroscopy shows that a Mo 3d5/2 core peak at 230.1 eV emerges in the annealed MoS2 associated with nonstoichiometric MoSx (0 < x < 2), and Raman spectroscopy shows an enhancement of the ∼380 cm–1 peak that is attributed to sulfur vacancies. At sulfur vacancy densities of ∼1.8 × 1014 cm–2, we observe a defect peak at ∼1.72 eV (referred to as LXD) at room temperature in the photoluminescence (PL) spectrum. The LXD peak is attributed to excitons trapped at defect-induced in-gap states and is typically observed only at low temperatures (≤77 K). Time-resolved PL measurements reveal that the lifetime of defect-mediated LXD emission is longer than that of band edge excitons, both at room and low temperatures (∼2.44 ns at 8 K). The LXD peak can be suppressed by annealing the defective MoS2 in sulfur vapor, which indicates that it is possible to passivate the vacancies. Our results provide insights into how excitonic and defect-mediated PL emissions in MoS2 are influenced by sulfur vacancies at room and low temperatures

    Probabilistic String Similarity Joins

    No full text
    Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline

    Effect of 3-Mercaptopropyltriethoxysilane Modified Illite on the Reinforcement of SBR

    No full text
    To achieve the sustainable development of the rubber industry, the substitute of carbon black, the most widely used but non-renewable filler produced from petroleum, has been considered one of the most effective ways. The naturally occurring illite with higher aspect ratio can be easily obtained in large amounts at lower cost and with lower energy consumption. Therefore, the expansion of its application in advanced materials is of great significance. To explore their potential use as an additive for reinforcing rubber, styrene butadiene rubber (SBR) composites with illites of different size with and without 3-mercaptopropyltriethoxysilane (KH580) modification were studied. It was found that the modification of illite by KH580 increases the K-illite/SBR interaction, and thus improves the dispersion of K-illite in the SBR matrix. The better dispersion of smaller size K-illite with stronger interfacial interaction improves the mechanical properties of SBR remarkably, by an increment of about nine times the tensile strength and more than ten times the modulus. These results demonstrate, except for the evident effect of particle size, the great importance of filler–rubber interaction on the performance of SBR composites. This may be of great significance for the potential wide use of the abundant naturally occurring illite as substitute filler for the rubber industry

    Germinal disc region: an appropriate source for obtaining maternal DNA from eggs

    No full text
    Eggs may serve as an alternative source for DNA extraction. The quality of DNA extracted from eggshell, whole egg liquid (WEL) and germinal disc region (GDR) was compared based on the spectrophotometric, electrophoretic, PCR and reduced-representation library sequencing (RRLS) results. Although these DNAs were all invisible on the gel and can not be measured spectrophotometrically, the GDR DNA was superior to the eggshell and WEL DNA in PCR efficiency. After the whole genome amplification (WGA) was introduced, the yield of GDR DNA was significantly increased. The obtaining DNA had overwhelming superiority over the eggshell and WEL DNA in the ratio of captured genome and the number of called SNP. The GDR DNA extraction followed by the WGA provides a method to obtain sufficient DNA from a single egg.The accepted manuscript in pdf format is listed with the files at the bottom of this page. The presentation of the authors' names and (or) special characters in the title of the manuscript may differ slightly between what is listed on this page and what is listed in the pdf file of the accepted manuscript; that in the pdf file of the accepted manuscript is what was submitted by the author

    Active learning in keyword search-based data integration

    No full text
    The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration-where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers' quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few ``top-'' results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result's score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains
    corecore